Sequencing and Raw Sequence Data Quality Control ◾ 43
Trimmomatic
is
available
at
“http://www.usadellab.org/cms/index.php?page=
trimmomatic”. You can download it from the website and unzip it using the following
script:
$ wget http://www.usadellab.org/cms/uploads/supplementary/
Trimmomatic/Trimmomatic-0.39.zip
$ unzip Trimmomatic-0.39.zip
Notice that the version may change in the future. The unzipped directory is
“Trimmomatic-0.39”, where there will be two files (“LICENSE” and “trimmomatic-0.39.
jar”) and a directory (“adapters”). The file “trimmomatic-0.39.jar” is the Java executable
program that performs the preprocessing tasks and the directory “adaptors” contains the
known adaptor sequences in FASTA files. The following script uses Trimmomatic to repro-
cess the paired-end FASTQ files, then runs FastQC to generate QC reports, and finally
displays the reports on the Firefox browser:
java -jar ../Trimmomatic-0.39/trimmomatic-0.39.jar \
PE SRR957824_1.fastq SRR957824_2.fastq \
out_PE_SRR957824_1.fastq out_UPE_SRR957824_1.fastq \
out_PE_SRR957824_2.fastq out_UPE_SRR957824_2.fastq \
ILLUMINACLIP:TruSeq3-PE.fa:2:30:10:2:True \
LEADING:3 \
TRAILING:3 \
ILLUMINACLIP:TruSeq2-PE.fa:2:30:10 \
SLIDINGWINDOW:5:30 \
MINLEN:35
fastqc out_PE_SRR957824_1.fastq out_PE_SRR957824_2.fastq
firefox out_PE_SRR957824_1_fastqc.html out_PE_SRR957824_2_fastqc.
html
The option “PE” is used for paired end, and then the two paired-end FASTQ files
“SRR957824_1.fastq” and “SRR957824_2.fastq” were provided as inputs. The adaptors that
were detected and removed from the reads are stored in the “TruSeq3-PE.fa” file in the
“adaptors” directory. Hence, “ILLUMINACLIP:TruSeq2-PE.fa” is used to specify the file
in which the adaptor sequences are stored. The program removed the leading and trail-
ing edges of reads with low quality that is below 3 Phred quality score. The “SLIDING-
WINDOW:5:35” is used so that the program can scan the read with a 5-base wide sliding
window and remove a read when the window per base average quality score declines to
below 30. Finally, the program removes the reads that are shorter than 35 bases.
In Figures 1.36 and 1.37, notice how the quality of the two files have been improved and
also notice that the total sequence is equal in both files. However, the read lengths vary.
If for any reason we need reads of the same length as some aligners may require, we can
set “MINLEN:” to the maximum length. Since the maximum read length is 150 bases, we
can use “MINLEN:151” as follows: